Red Wine Quality provided by Udacity
This tidy data set contains 4,898 white wines with 11 (major) variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Which chemical properties influence the quality of white wines?
This dataset contains the informationabout white wine quality. Wines have been rated by some experts. I have limited knowledge on wines but, it would really interesting to see the ingredient vs qualty pattern of white wines. By this exploration mechanism, I would like to gain some insights about what are the chemicals, specific ingredients which makes a wine taste better. This exploratory analysis could be used by wine makers.
Here we explore the dataset as follows
## [1] "No of data points: 4898"
## [1] "No of features: 13"
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
One can notice the various chemicals (ingredients) of white wines. Now that we have seen the variables, I would like to plot some variable one by one. This is to see what insights they can give.
Here I explore the univariate plots. As part of the univariate analysis I would like to explore the various features of the wine dataset and see thier patterns, find something insightful.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
I feel that density of a wine and its alcohol content are most important ingredients. But before I explore them I would like to explore the Quality Rating of each wine given by the experts and its distribution.
The bar chart here tells that the most of the wines were given an average rating of 6 (in the range 5-7). The quality is normal distribution as shown in barplot.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## [1] 5.877909
As expected the average quality rating is 5.87.
The above box plot shows that most of the wines have a density of around 0.995.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
To ground this further, summary stats show that median density is 0.9937, mean density is 0.9940, and density values range from 0.9917 to 0.9961 in the inter-quartile range (within 1st and 3rd Quartiles).
Now I want to the amount of alcohol content in white wines. Wow!No outliers to be seen in this distribution. Most of the wines seems to have an alcohol content of 9.5 - 11.5 units. However this IQR is wide. It would be really interesting to see its variation with quality or density in the bivariate analysis.
As per my knowledge, wine experts use their senses to taste wines: sight, smell, taste. The different chemical ingredients account for the various senses of the wine. For example, residual sugar make a sweetness, citric acid is related to a freshness, and acid or tannin make an astringent taste. So, I’m interested in citric acid, residual sugar, and fixed acidity.
First, I would like to explore the fixed acidity feature of the dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Notice how the fixed acity histogram is also normally distributed. Most of white wines have 6~7 (g/dm^3) of fixed acidity.
I would now explore the citric acid content of the wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid distribution looks normal distribution. Most of white wines have 0.3 (g/dm^3) of citric acid. There is an interesting peak near 0.5 (g/dm^3). I wonder why is this.
I would now explore the residual sugar content of the wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual sugar distribution is skewed. The largest spikes in the 1~2 (g/dm^3). This distributions tells that very sweet wine is rare.
Now that I want to see the distribution of volatile acidity (to know how different it is from fixed acidity), pH (this will tell weather the wine is acidic or basic overall), chlorides (to know its salt contents).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile acidity seems normal distribution. Most white wines are 0.2 acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH is also normally distributed. Median of pH values of the wine in the dataset is 3.18.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Chloride is also normally distributed till 0.1 chloride conent, but unusually distributed has small number of data points with cloride more than 0.1. Most of the wines have 0.045 cloride content.
Finally as part of the univariate plots, I want to see the distribution of sulphate values and sulphur dioxide content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulphates seems normal distribution. Most white wines have 0.5 sulphates.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Free sulfur dioxide seems normal distribution. Most white wines have 34 free sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total sulfur dioxide seems normal distribution. Most white wines have 130 total sulfur dioxide.
There are 4898 observations and 13 features. Input variables which includes 11 chemical features of white wine and output variable which is wine quality. The quality of the wine is an integer variable which has has a min 3.0 and max 9.0, with a median 6.0 and mean 5.878.
All the chemical property variables are floating numbers. They are of different unit and therefore lie in widely different range. For example, the chlorides variable has a small range from 0.009 to 0.346, while the total.sulfur.dioxide variable has a large range from 9.0 to 440.0.
The main features in the data set are alcohol and quality. I suspect alcohol and some combination of other variables can be used to build a predictive model to the wine quality. I would like to explore two variables in bivariate analysis.
Features such as residual sugar, sulphates, pH, chlorides will likely contribute to the wine quality and will support our investigation.
No. So far, I havent created any new variables as all variable seems to tidy.
During the investigation, I found the distribution of chlorides variable has an unusual distribution. From the histogram of chloride, we see that the majority of samples lie in the range of [0, 0.1] in a normal distribution shape, but there are a small number of outliers that lie far beyond this normal range (up to 0.34), which indicates this is a long-tail distribution.
In order to better visualize this distribution, I would like to Cut off the samples that are beyond 0.1, and only “zoom in” to look at those in the “regular range” All the three plots individually show normal distribution.
Now that I have explored some individual variables, I would like to know their relationships with each other. We start with the Bivariate plot section next.
I wish to know if there is any correlation between various features.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## quality 0.035763247 -0.11366283 -0.194722969
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## quality -0.009209091 -0.097576829 -0.20993441
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## quality 0.0081580671 -0.174737218 -0.30712331
## pH sulphates quality
## X -0.1157741316 0.009807759 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.174737218
## density -0.0935914935 0.074493149 -0.307123313
## pH 1.0000000000 0.155951497 0.099427246
## sulphates 0.1559514973 1.000000000 0.053677877
## quality 0.0994272457 0.053677877 1.000000000
High correlations (≥ 40% in absolute value) are identified and marked in red. Pairwise scatterplots are also shown below.
Higher quality wines seems to have lower levels of total sulphur dioxide as the median value seems to fall with increase in quality . The highest rated wine has the least total Sulphur dioxide content .
Higher quality wines seems to have higher levels of alcohol as the median value seems to rise consistently with increase in quality . The highest rated wine has the highest alcohol content .
There seems to be a direct correlation between density and Residual Sugar as they seem to be positively corelated . This makes sense since increasing the residual sugar , the mass will increase .Thus density is directly propotional to mass (Since density = mass/volume).
The ratio of residual sugar and citric acid seems to play a high role in quality . This can be explained by the fact that good quality wines are crisp and dry . Check the link in references for further explaination .
I tried finding the relationship between() Quality of wine vs. residual sugar,citric acid ratio),(Quality of wine and total sulphir dioxide) and (Quality of wine and alcohol) . The quality of wine seemed to be positively correlated with alcohol content . However there was negative correlation between quality and total sulphur dioxide . The residual sugar and citric acid ratio seems to play an important role . This is because they directly affect the cripiness/dryness of wines . Good wines tend to be crisp and dry. Thus, good wines have high acidity and lower sugar levels . You can check the references for more information about ‘crispiness’ of wines.
All plots were as expected, so nothing was extraordinary . The relationship between density and residual sugar was quite straightforward . Since the denisity is directly propotional to mass . Higher levels of sugar tend to increase the mass .
Quality of wine and and alcohol seems to highly correlated. The higher the alcohol, higher the quality of wine.
Next we will explore the interaction between multiple varaibles .
The pH indicates whether a wine is acidic or alkaline. Citric acid and alcohol seems to increase the pH value. This makes wine more crispy/dry .
It can be seen clearly that high quality wine tend to be less sweet and more crispy . This is due to higher levels of citric acid and less sugar. This makes the wine more dry . Also , alcohol level is positively correlated to the quality of wine .
I tried finding the relationship between acidity(pH) vs citric acid and alcohol. I found out that pH value seemed to increase with increase in of alcohol while citric acid’s quantity is fixed.
Yes I found out that good quality wine seemed to have lower level of sugar . Also they had high levels of alcohol . Good quality wines also seem to have a good ratio of citric acid vs residual sugar . This is done to ensure that wine remains crispy and dry . Good quality wines also had high levels of alcohol .
This plot basically suggests that majority of wines have rating 5, 6 or 7 . This plot also follows a normal distribution .
Higher quality wines seems to have higher levels of alcohol as the median value seems to rise consistently with increase in quality . The highest rated wine has the highest alcohol content .
Sulphur Dioxide is basically used as preserving agent in wine . However, it presence produces a pungent aroma which is undesirable in wines .Higher quality wines seems to have lower levels of total sulhpur dioxide as the median value seems to fall consistently with increase in quality . The highest rated wine has the least total sulphur dioxide . Check the references for more information on sulhpur dioxide .
The dataset seemed to be quite long and interesting . After performing the analyis, I learnt a great deal about wines . After performing analysis, I found out many factors that affect quality of wine. These include alcohol , sulphur dioxide, residual sugar and citric acid . Higher the level of alcohol the better the wine . The opposite is true for Sulphur Dioxide . It negatively effects the quality of wine . Good quality wines seems to have a good ratio of citric acid and sugar level maintained. This ensures that wine is crispy and dry .
Initially I had great trouble understanding the different factors . I had to p study about fermentation process to better understand these factors . I believe if I had more knowledge about chemistry I could have imporved my analysis . Some sort of feature engineering would have definitely helped as well . Also some sort machine learning model can be used to predict quality for future analysis as well . This will also help to understand the relationship between quality of wine and various factors .